{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://blog.csdn.net/kizgel/article/details/78553009?locationNum=6&fps=1#imblearn-package-study\n",
    "\n",
    "## Imblearn package study\n",
    "### 1.1 Compressed Sparse Rows (CSR)压缩稀疏的行  \n",
    "稀疏矩阵中存在许多0元素，按矩阵A进行存储会占用很大的空间（内存）  \n",
    "CSR方法采取按行压缩的办法，将原始矩阵用三个数组进行表示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.array([1,2,3,4,5,6])\n",
    "indices = np.array([0,2,2,0,1,2])\n",
    "indptr = np.array([0,2,3,6])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "data数组： 存储着矩阵A中所有的非零元素  \n",
    "indices数组：data数组中的元素在矩阵A中的列索引  \n",
    "indptr数组：存储着矩阵A中每行第一个非零元素在data数组中的索引  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[1, 0, 2],\n",
       "        [0, 0, 3],\n",
       "        [4, 5, 6]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy import sparse\n",
    "mtx = sparse.csr_matrix((data, indices, indptr), shape=(3,3))\n",
    "mtx.todense()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " 为什么会有针对不平衡数据的研究？  \n",
    " 当我们的样本数据中，正负样本的数据占比及其不平衡的时候，模型的效果就会偏向于多数类的结果，具体的可参照[官网](https://imbalanced-learn.org/en/stable/introduction.html)利用支持向量机进行可视化不同正负样本比例情况下的模型分类结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 过采样（Over-sampling）  \n",
    "\n",
    "1. 朴素随机过采样  \n",
    "真的不平衡的数据，最简单的一方法就是生成少数类的样本，这其中最简单的一种方法就是：从少数类的样本中进行随机采样来增加新的样本  \n",
    "RandomOverSampler 函数就能实现上述的功能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({2: 4674, 1: 262, 0: 64})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import make_classification\n",
    "from collections import Counter\n",
    "X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,\n",
    "                          n_redundant=0, n_repeated=0, n_classes=3,\n",
    "                          n_clusters_per_class=1,\n",
    "                          weights=[0.01, 0.05, 0.94],\n",
    "                          class_sep=0.8, random_state=0)\n",
    "Counter(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 4674), (1, 4674), (2, 4674)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.over_sampling import RandomOverSampler\n",
    "\n",
    "ros = RandomOverSampler(random_state=0)\n",
    "x_ressampled, y_resampled = ros.fit_sample(X, y)\n",
    "\n",
    "sorted(Counter(y_resampled).items())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以上就是通过简单的随机采样少数类的样本，使得每类样本的比例为1:1:1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 从随机过采样到SMOTE与ADASYN  \n",
    "相对于采样随机的方法进行过采样，还有两种比较流行的采样少数类的方法：\n",
    "    * Synthetic Minority Oversampling Technique(SMOTE)\n",
    "    * Adaptive Synthetic(ADASYN)\n",
    "\n",
    "SMOTE:对于少数类样本a, 随机选择一个最近邻的样本b，然后从a与b的连线上随机选取一个点c作为新的少数类样本  \n",
    "ADASYN：关注的是在那些基于K最近邻分类器被错误分类的原始样本附近生成新的少数类样本  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 4674), (1, 4674), (2, 4674)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.over_sampling import SMOTE, ADASYN\n",
    "\n",
    "X_resampled_smote, y_resampled_smote = SMOTE().fit_sample(X,y)\n",
    "sorted(Counter(y_resampled_smote).items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 4673), (1, 4662), (2, 4674)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_resampled_adasyn, y_resampled_adasyn = ADASYN().fit_sample(X,y)\n",
    "sorted(Counter(y_resampled_adasyn).items())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. SMOTE的变体  \n",
    "相对于基本的SMOTE算法，关注的是所有的少数类样本，这些情况可能会导致产生次优的决策函数，因此SMOTE就产生了一些变体：这些方法关注在最优化决策函数边界的一些少数类样本，然后再最近邻类的相反方向生成样本。  \n",
    "SMOTE函数中的kind参数控制了选择哪种变体\n",
    "    * borderline1  \n",
    "    * borderline2  \n",
    "    * svm  \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 4674), (1, 4674), (2, 4674)]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from imblearn.over_sampling import SMOTE, ADASYN\n",
    "X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(X, y)\n",
    "\n",
    "sorted(Counter(y_resampled).items())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. 数学公式  \n",
    "SMOTE算法与ADASYN都是基于同样的算法来合成新的少数类样本：对于少数类样本a，从它的最近邻中选择一个样本b，然后在俩点的连线上随机生成一个新的少数类的样本，不同的是对于少数类样本的选择。  \n",
    "原始的SMOTE:kind='regular'，随机选取少数类的样本  \n",
    "The borderline SMOTE：kind='borderline1' or kind='borderline2'\n",
    "此时，少数类的样本分为三类\n",
    "    * 噪音样本，该少数类的所有最近邻样本都来自与不同于样本a的其他类别\n",
    "    * 危险样本，至少一半的最近邻样本来自于同一类（不同于a的类别）\n",
    "    * 安全样本，所有的最近邻样本都来自与同一个类\n",
    "    \n",
    "这两种类型的SMOTE使用的是危险样本来生成新的样本数据，对于borderline-1 SMOTE，最近邻中的随机样本b与该少数类样本a来自于不同的类，不同的是，对于Borderline2 SMOTE，随机样本b可以是属于任何一个类的样本。\n",
    "\n",
    "SVM SMOTE： kind = 'svm' 使用支持向量机分类器产生支持向量然后再生成新的少数类样本。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
